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' Abstract. One of the most famous and investigated lossless data-compression scheme 

is the one introduced by Lempel and Ziv about 40 years ago [23]. This compression 
scheme is known as "dictionary-based compression" and consists of squeezing an input 
' string by replacing some of its substrings with (shorter) codewords which are actu- 

D : ally pointers to a dictionary of phrases built as the string is processed. Surprisingly 

enough, although many fundamental results are nowadays known about upper bounds 
\^ I on the speed and effectiveness of this compression process (see e.g. [12, 16] and refer- 

ences therein), "we are not aware of any parsing scheme that achieves optimality when 
the LZn -dictionary is in use under any constraint on the codewords other than being 
of equal length" [16, pag. 159]. Here optimality means to achieve the minimum num- 
ber of bits in compressing each individual input string, without any assumption on its 
^ ' generating source. In this paper we provide the first LZ-based compressor which com- 

O . putes the bit-optimal parsing of any input string in efhcient time and optimal space, 

for a general class of variable-length codeword encodings which encompasses most of 
the ones typically used in data compression and in the design of search engines and 
^ . compressed indexes [14,17,22] 

in 
cn 

OO ■ 1 Introduction 

p 

' The problem of lossless data compression consists of compactly representing data in a format 

that can be faithfully recovered from the compressed file. Lossless compression is achieved by 
taking advantage of the redundancy which is often present in the data generated by either 
humans or machines. One of the most famous lossless data-compression scheme is the one 
introduced by Lempel and Ziv in the late 70s [23,24]. It has been "the solution" to lossless 
compression for nearly 15 years, and indeed many (non-) commercial programs are currently 
$H ' based on it- like gzip, zip, pkzip, arj, rar, just to cite a few. This compression scheme is 

known as dictionary-based compression, and consists of squeezing an input string S'[l,n] by 
replacing some of its substrings (phrases) with (shorter) codewords which are actually pointers 
to a dictionary being either static (in that it has been constructed before the compression 
starts) or dynamic (in that it is built as the input string is compressed). The well-known LZ77 
and LZ78 compressors, proposed by Lempel and Ziv in [23, 24], and all their numerous variants 
[17], are interesting examples of dynamic dictionary-based compression algorithms. In LZ77, 
and its variants, the dictionary consists of all substrings occurring in the previously scanned 
portion of the input string, and each codeword consists of a triple (d, i, c) where d is the 
relative offset of the copied phrase, i is its length and c is the single (new) character following 
it. In LZ78, the dictionary is built upon phrases extracted from the previously scanned prefix 
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of the input string, and each codeword consists of a pair (id, c) where id is the identifier 
of the copied phrase in the dictionary and c is the character foUowing that phrase in the 
subsequent suffix of the string. 

Many theoretical and experimental results have been dedicated to LZ-compressors in these 
thirty years (see e.g. [17] and references therein); and, although today there are alternative 
solutions to the problem of lossless data compression potentially offering better compres- 
sion bounds (e.g.. Burrows- Wheeler compression and Prediction by Partial Matching [22]), 
dictionary-based compression is still widely used in everyday applications because of its unique 
combination of compression power and speed. Over the years dictionary-based compression 
has also gained importance as a general algorithmic tool, being employed in the design of 
compressed text indexes [14], or in universal clustering tools [3], or in designing optimal pre- 
fetching mechanisms [21]. 

Surprisingly enough some important problems on the combinatorial properties of the LZ- 
parsing are still open, and they will be the main topic of investigation of this paper. Take the 
classical LZ77 and LZ78 algorithms. They adopt a greedy parsing of the input string: namely, 
at each step, they take the longest dictionary phrase which is a prefix of the currently unparsed 
string suffix. It is well-known [16] that greedy parsing is optimal with respect to the number of 
phrases in which S can be parsed by any suffix-complete dictionary (like LZ77): and a small 
variation of it (called flexible-paising [13]) is optimal for prefix-complete dictionaries (like 
LZ78). Of course, the number of parsed phrases infiuences the compression ratio and, indeed, 
various authors [23, 24, 12] proved that greedy parsing achieves asymptotically the (em,pirical) 
entropy of the source generating the input string S. However, these fundamental results have 
not yet closed the problem of optimally compressing S because the optimality in the number 
of parsed phrases is not necessarily equal to the optimality in the number of bits output by 
the final compressor. Clearly, if the phrases are compressed via an equal-length encoder, like 
in [17, 23, 12], then the produced output is bit optimal. But if one aims for higher compression 
by using variable-length encoders for the parsed phrases (sec e.g. [22.7]), then the bit-length 
of the compressed output produced by the greedy-parsing scheme is not necessarily optimal. 

As an illustrative example, consider the LZ77-compressor and assume that the copy of the 
ith phrase occurs very far from the current unparsed position, and thus its c?-value is large. In 
this case it could be probably more convenient to renounce to the maximality of that phrase 
and split it into (several) smaller phrases which possibly occur closer to that position and thus 
can be encoded in fewer bits overall. Several solutions are indeed known for determining the 
bit-optimal parsing of S, but they are either inefficient [18,8] taking ©(n^) time and space 
in the worst case, or approximate [9], or they rely on heuristics [11,19,2,4,8] which do not 
provide any guarantee on the time/space performance of the compression process. This is the 
reason why Rajpoot and Sahinalp stated in [16, pag. 159] that "We are not aware of any 
on-line or off-line parsing scheme that achieves optimality when the LZ77-dictionary is in use 
under any constraint on the codewords other than being of equal length" . 

Motivated by these poor results, we address in this paper the question posed by Rajpoot 
and Sahinalp by investigating a general class of variable-length codeword encodings which are 
typically used in data compression and in the design of search engines and compressed indexes 
[14, 17, 22]. We prove that the classic greedy-parsing scheme deploying these encoders may be 
far from the bit-optimal parsing by a multiplicative factor J7(logn/ log log n), which is indeed 
unbounded asymptotically (Section 2, Lcnuna 1). This result is obtained by considering an 
infinite family of strings S of increasing length n and low empirical entropy, and by showing 
that for these strings copying the longest phrase, as LZ77 does, may be dramatically inefficient. 



We notice that this gap between LZ77 and the bit-optimal compressor strengthen the results 
proved by Kosaraju and Manzini in [12], who showed that the compression rate of LZ77 
converges asymptotically to the fcth order empirical entropy Hk{S) of string S, and this rate 
is dominated for low entropy strings by an additive term 0( ^°^J°^" ) which depends on the 
string length n and not on its compressibility Hk{S). These are properly the strings for which 
the bit-optimal parser is much better than LZ77, and thus closer to the entropy of S. 

Given these premises, we investigate and design an LZ-based compressor that computes the 
bit-optimal parsing for those variable-length integer encoders in efficient time and optimal 
space, in the worst case (Section 5, Theorem 2). Due to space limitations, we will detail 
our results only for the LZ77-dictionary, and defer the discussion on other dictionary-based 
schemes (like LZ78) to the last Section 6. Technically speaking, we follow [18] and model the 
search for a bit-optimal parsing of an input string S'[l,n], as a single-source shortest path 
problem (shortly, SSSP) on a weighted DAG G{S) consisting of n nodes, one per character of 
T, and e edges, one per possible LZ77-parsing step. Every edge is weighted according to the 
length in bits of the codeword adopted to compress the corresponding LZ77-phrase. Since 
LZ-codcwords are tuples of integers (sec above), we consider in this paper a class of codeword 
encoders which satisfy the so called increasing cost property: the larger is the integer to be 
encoded, the longer is the codeword. This class encompasses most of the encoders frequently 
used in the literature to design data compressors [7], compressed full-text indexes [14] and 
search engines [22]. We prove new combinatorial properties for the SSSP-problem formulated 
on the graph G{S) weighted according to these encoding functions and show that, unlike 
[18] (for which e = 0{n'^) in the worst case), the computation of the SSSP in Q{S) can be 
restricted onto a subgraph Q{S) whose size is provably smaller than the complete graph (see 
Theorem 1). Actually, we show that the size of G{S) is related to the structural features of 
the integer-encoding functions adopted to compress the LZ-phrases (Lemma 3). Finally, we 
design an algorithm that computes the SSSP of G{S) without materializing that subgraph all 
at once, but by creating and exploring its edges on-the-fly in optimal 0(1) amortized time per 
edge and using 0{n) optimal space overall. As a result, our LZ77-compressor achieves optimal 
compression ratio, by using optimal 0{n) working space and taking time proportional to 
1^(5') I (hence, it is optimal in its size). 

If the LZ-phrases are encoded with equal-length codewords, our approach is optimal in 
compression ratio and time/space performance, as it is classically known [17]. But if we 
consider the more general (and open) case of the variable-length Elias or Fibonacci codes, as 
it is typical of data compressors and compressed indexes [7, 14] , then our approach is optimal 
in compression ratio and working-space occupancy, and takes 0(n log n) time in the worst 
case. Most other variable- length integer encoders fall in this case too (see e.g. [22]). To the 
best of our knowledge, this is the first result providing a positive answer to Rajpoot-Sahinalp's 
question above! 

The final Section 6 will discuss variations and extensions of our approach, considering 
the cases of a bounded compression-window and of other sufRx- or prefix- complete dictionary 
construction schemes (like LZ78). 

2 On the Bit-Optimality of LZ-pcirsing 

Let ^[1, n] be a string drawn from an alphabet S of size a. We will use S[i] to denote the ith 
symbol of S; S[i : j] to denote the substring (also called the phrase) extending from the ith 
to the jth symbol in S (extremes included); and Si = S[i : n] to denote the i-th suffix of S. 



Dictionary-based compression works in two intermingled phases: parsing and encoding. 
Let 'Wi,W2, ■ ■ ■ ,Wi-i be the phrases in which a prefix of S has been aheady parsed. The 
parser selects the next phrase Wi as one of the phrases in the current dictionary that prefix 
the remaining suffix of S, and possibly attaches to this phrase few other following symbols 
(typically one) in S. This addition is sometimes needed to enrich the dictionary with new 
symbols occurring in S and never encountered before in the parsing process. Phrase Wi is 
then represented via a proper reference to the dynamic dictionary, which is built during the 
parsing process. The well-known LZ77 and LZ78 compressors [23,24], and all their variants 
[17], are incarnations of the above compression scheme and differentiate themselves mainly 
by the way the dynamic dictionary is built, by the rule adopted to select the next phrase, 
and by the encoding of the dictionary references. 

In the rest of the paper we will concentrate on the LZ77-scheme and defer the discussion 
of LZ78's, and other approaches, to the last Section 6. In LZ77 and its variants, the dictionary 
consists of all substrings starting in the scanned prefix of S,^ and thus it consists (implicitly) 
of a quadratic number of phrases which are (explicitly) represented via a (dynamic) indexing 
data structure that takes linear space in jS'j and efficiently searches for repeated substrings. 
The LZ77-dictionary satisfies two main properties: Prefix Completeness — i.e. for any given 
dictionary phrase, all of its prefixes are also dictionary phrases — and Suffix Completeness — 
i.e. for any given dictionary phrase, all of its suffixes are also dictionary phrases. At any step, 
the LZ77-parscr proceeds by adopting the so called longest match heuristic: that is, Wi is taken 
as the longest phrase of the current dictionary which prefixes the remaining suffix of S. This 
will be hereafter called greedy parsing. The classic LZ77-parser finally adds one further symbol 
to Wi (namely, the one following the phrase in S) to form the next phrase of S"s parsing. 

In the rest of our paper, and without loss of generality, we consider the LZ77- variant 
which avoids the additional symbol per phrase, and thus represents the next phrase Wi by 
the integer pair {di,ii) where di is the relative ojfset of the copied phrase Wi within the prefix 
Wi - ■ ■ Wi-i and £i is its length (i.e. jwij). We notice that every first occurrence of a new symbol 
c is encoded as (0, c). Once phrases are identified and represented via pairs of integers, their 
components are compressed via variable-length integer encoders which will eventually produce 
the compressed output of 5* as a sequence of bits. 

In order to study and design bit-optimal parsing schemes, we therefore need to deal with 
integer encoders. Let / be an integer-encoding function that maps any integer x £ [n] into 
a (bit-)codeword /(x) whose length (in bits) is denoted by |/(.t)|. In this paper we consider 
variable-length encodings which use longer codewords for larger integers: 

Property 1 (Increasing Cost Property). For any x,y & [n\ it is a; < iff |/(a;)| < |/(y)|. 

This property is satisfied by most practical integer encoders [22], as well by equal- length 
codewords, Elias codes (i.e. gamma, delta, and their derivatives [6]), and Fibonacci's codes 
[7]. Therefore, this class encompasses all encoders typically used in the literature to design 
data compressors [17], compressed full-text indexes [14] and search engines [22]. 

Given two integer-encoders / and g (possibly f = g) which satisfy the Increasing Cost 
Property 1, we denote by LZ f^g{S) the compressed output produced by the greedy-parsing 
strategy in which we have used / to compress the distance di, and g to compress the length 
£i of any parsed phrase Wi. LZf^g{S) thus encodes any phrase Wi in \f{di)\ + \g{ii)\ bits. We 



^ Notice that it admits the overlapping between the current dictionary and the next phrase to be 
constructed. 



have already noticed that LZ f^g{S) is not necessarily bit optimal, so we will hereafter use 
0?Tf^g{S) to denote the {f,g)- optimal parser, namely the one that parses S into a sequence 
of phrases which are drawn from the LZ77-dictionary and which minimize the total number 
of bits produced by their encoders / and g. Given the above observations it is immediate to 
infer that |LZj^g(5)| > |OPT/^g(S')|. However this bound docs not provide us with any estimate 
of how much worse the greedy parsing can be with respect to 0PTj^g(5). In what follows we 
identiiy an infinite family of strings for which the compressed output of the greedy parser is a 
multiplicative factor /2(logn/loglogn) worse than the bit-optimal parser. This result shows 
that the ratio \Q^j'^''^fg^i^ is indeed asymptotically unbounded, and thus poses the serious need 
for an (/, (/)-optimal parser, as clearly requested by [16]. 

Our argument holds for any choice of / and g from the family of encoding functions that 
represent an integer x with a bit string of size 6*(logx) bits. (Thus the well-known Elias' and 
Fibonacci's coders belong to this family.) Taking inspiration from the proof of Lemma 4.2 in 
[12], we consider the infinite family of strings Si = ha} c^' ba ha? ha^ . . .ba\ parameterized in 
the positive value I. The LZ77-parser partitions Si as^ 

(6) (a) (a'-i) (c) (c^'-i) {ba) {ba^) {ba^) ... (6a') 

where the symbols forming a parsed phrase have been delimited within a pair of brackets. LZ77 
thus copies the latest I phrases from the beginning of Si and takes at least / |/(2')| = 0(1"^) 
bits. Let us now consider a more parsimonious parser, called rOPT, which selects the copy of 
6a'~^ (with i > 1) from its immediately previous occurrence: 

(6) (a) (a'-i) (c) (c^'-i) (6) (a) {ba) (a) {ba') (a) . . . {ba'-') {a) 

rOPT(50 takes \g{2')\ + \g{l)\ + E-=2[I/(0I + \g(i)\ + /(O)] + 0{l) = 0{llogl) bits. Since 
OPT {Si) < rOPT(5';), we can conclude that 

|LZ/,g(50| ^ |LZ;,3(^/)| > / J_ 

\OPTfASl)\ - |rOPT(50l - Vlog/ 

Since \Si\ = 2' + — 0{l), we have that I = ©(log \Si\) for sufficiently long strings. Using 
this estimate into Inequality 1 we finally obtain: 

Lemma 1. There exists an infinite family of strings such that, for any of its elements S, it 
is \LZf,g{S)\ > 0(log|5|/loglogl5|) |0PT/,,(5)|. 

On the other hand, we can prove that this lower bound is tight up to a log log 15*1 multi- 
plicative factor, by easily extending to LZ77-dictionary and Property 1, a result proved in [10] 

for static dictionaries. Precisely, we can show that Iq^^^^^s^^ < ITWfTlfwr' ^^^^^ upper 
bounded by O(logn) because \S\ = n, \f{n)\ = \g{n)\ = 6>(logn) and |/(0)| = |5(0)| = 0(1). 

3 Bit-Optimal Pcirsing and SSSP-problem 

Following [18], we model the design of a bit-optimal LZ77-parsing strategy for a string S' as a 
Single-Source Shortest Path problem (shortly, SSSP-problem) on a weighted DAG G{S) defined 




^ Recall the variant of LZ77 we are considering in this paper, which uses just a pair of integers per 
phrase, and thus drops the char following tliat phrase in S. 



as follows. Graph G{S) ~ {V, E) has one vertex per symbol of S plus a dummy vertex u„+i, 
and its edge set E is defined so that {vi,Vj) e iff (1) j = z + 1 or (2) the substring S[i : j — 1] 
occurs in S starting from a (previous) position p < i (clearly i < j and thus Q{S) is a DAG). 
Every edge {vi,Vj) is labeled with the pair {dij,£ij) where 

— dij = i — p is the distance between S[i : j — 1] and the position p of its right-most copy 

in S. We set dij = whenever j = i + 1. 

— We set £i,j — j — i, if j > i + 1, or £ij = S[i] otherwise. 

It is easy to sec that the edges outgoing from Vi denote all possible parsing steps that 
can be taken by any parsing strategy which uses a LZ77-dictionary. Hence, there exists a 
one-to-one correspondence between paths from vi to Wn+i in GiS) and parsings of the whole 
string S: any path tt = vi ■ ■ ■ Vn+i corresponds to the parsing V-rriS) represented by the 
phrases labeling the edges traversed by tt. If we weight every edge (u^, Vj) EE with an integer 
c{vi,Vj) = \,f{di,j)\ + \g{ii,j)\ which accounts for the cost of encoding its label (phrase) via 
the encoding functions / and g, then the length in bits of the encoded parsing 'P-n{S) is 
equal to the cost of the weighted path tt in Q{S). We have therefore reduced the problem of 
determining 0PT/^g(5) to the problem of computing the SSSP of G{S) from vi to Vn+i- 

Given that G{S) is a DAG, its shortest path from vi to v„+i can be computed in 0{\E\) 
time and space. In the worst case (take e.g. S = a"), this is 0{n^), and thus it is inefficient 
and unusable in practice [18,9]. In what follows we show that the computation of the SSSP 
can be actually restricted to a subgraph of GiS) whose size is provably oin^) in the worst 
case, and typically O(nlogn) for most known integer-encoding functions. Then we will design 
efficient and sophisticated algorithms and data structures that will allow us to generate this 
subgraph on-the-fly by taking 0(1) amortized time per edge and 0{n) space overall. These 
algorithms will be therefore time-and-space optimal for the subgraph in hand! 

4 An useful, small, subgraph of G{S) 

We use FS{v) to denote the forward star of a vertex v, namely the set of vertices pointed to 
by V in G{S); and we use BS{v) to denote the backward star of v, namely the set of vertices 
pointing to v in Q{S). By construction of G{S), for any Vj G FS{vi) it is i < j; so that, all of 
the edges are oriented rightward, and in fact G{S) is a DAG. We can actually show a stronger 
property on the distribution of the indices of the vertices in FS{v) and BS{v), namely that 
they form a contiguous range. 

Fact 1 Given a vertex Vi, itisFS{vi) = {vj+i . . . ,Vi+x-i,Vi+x} andBS{vi) = {vi^y . . . ,Vi-2,Vi-\}. 
Note that x,y are smaller than the length of the longest repeated substring in S. 
Proof: By definition of (;'i,Wi+a), string S[i : i + x — 1] occurs at some position p < i in S. 
Any prefix S[i : k — 1] of S[i : i + x — I] also occurs at that position p, thus Vk € FS{vi). The 
bound on x immediately derives from the definition of {vi,Vi+x)- A similar argument derives 
the property on BS{vi). □ 

This actually means that if an edge does exist in G{S), then they do exist also all edges 
which are nested within it and are incident into one of its extremes. The following property 
relates the indices of the vertices Vj £ FS{vi) with the cost of their connecting edge {vi,Vj), 
and not surprisingly shows that the smaller is j (i.e. shorter edge) , the smaller is the cost of 
encoding the parsed phrase S[i : j — 1].^ 

* Recall that c{vi,Vj) = \ f{di,j)\ + \g{£i,j)\, if the edge does exist, otherwise we set c{vi,Vj) = -|-oo. 



Fact 2 Given a vertex Vi, for any pair of vertices Vj',Vj" G FS{vi) such that j' < j", we 
have that c{vi,Vj') < c{vi,Vj"). The same property holds for Vj',Vj" G BS{vi). 

Proof: Wc have that di ji < dij" and £i ji < ii jn because S[i : j' — 1] is a prefix of S[i : j" — 1] 
and thus the first substring occurs wherever the latter occurs. The property holds because / 
and g satisfy the Increasing Cost Property 1. □ 

Given these monotonicity properties, we are ready to characterize a special subset of the 
vertices in FS{vi), and their connecting edges. 

Definition 1. An edge {vi,Vj) & E is called 

— d— maximal iff the next edge from Vi takes more bits to encode its distance. Namely, we 
have that \f{dij)\ < \f{dij+i)\. 

— ^— maximal iff the next edge from Vi takes more bits to encode its length. Namely, we have 
that \g{li,j)\ < \g{lij+i)\. 

Overall, we say that edge {vi,Vj) is maximal if it is either d-maximal or £-maximal: thus 

c{Vi,Vj) < c{Vi,Vj+i). 

Now, we wish to count the number of maximal edges outgoing from Vi. This number 
clearly depends on the integer-encoding functions / and g (which satisfy Property 1). Let 
Q{f,n) (rosp. Qig./n)) be the number of different codeword lengths generated by / (resp. 
g) when applied to integers in the range [n]. We can partition [n] in contiguous sub-ranges 
Ji, /2, ■ • ■ , iQ(f.n) such that the integers in li are mapped by / to codewords (strictly) shorter 
than the codewords for the integers in Similarly, g partitions the range [n] in Q{g,n) 

contiguous sub-ranges. 

Lemma 2. There are at most Q{f, n) + Q{g, n) m,a,xim,a,l edges outgoing from any vertex Wj. 

Proof: By Fact 1, vertices in FS{vi) have indices in a range R, and by Fact 2, c{vi,Vj) is 
monotonically non-decreasing as j increases in R. Moreover we know that / (resp. g) cannot 
change more than Q{f,n) (resp. Q{g,n)) times, so that the statement follows. □ 

In order to speed up the computation of a SSSP connecting vi to Vn+i in G{S), we construct 
a subgraph Q{S) which is provably smaller than G{S) and contains one of those SSSP. Next 
theorem shows that GiS) can be formed by taking just the maximal edges of G{S). 

Theorem 1. There exists a shortest path in G{S) connecting vi to Vn+i and traversing only 
maximal edges. 

Proof: By contradiction assume that every such shortest paths contains at least one non- 
maximal edge. Let n = Vi^Vi^ ■ ■ - Vi^, with i\ = \ and ik = n -\- \, he one of these shortest 
paths, and let 7 = ti^j . . . Vi^ be the longest initial subpath of tt which traverses only maximal 
edges. Assume w.l.o.g. that tt is the shortest path maximizing the value of |7|. We know 
that {vi^,Vi^^^) is a non-maximal edge, and thus we can take the maximal edge {vi^,Vj) that 
has the same cost. By definition of maximal edge, it is j > ir+i, furthermore, we must have 
j < n + 1 because we assumed that no path is formed only by maximal edges. Thus it must 
exist an index ih > ir such that j G [ih,ih+i], because indices in n are increasing given that 
G{S) is a DAG. Since (vi,^ , Vi^_^^) is an edge of tt, by Fact 1 follows that it does exist the 
edge {vjjVii^^^), and by Fact 2 on BS{vi^_^_^) we can conclude that c{vj,Vii^^^) < c{vii^,Vii^^-^). 
Consequently, the path Vi^ ■ ■ -Vi^VjVi^^^ ' "''^ik is also a shortest path but its longest initial 
subpath of maximal edges consists of I7I -|- 1 vertices, which is a contradiction! □ 



Optimal-Parser(5'[l, n]) 

1. C[l] = 0; P[l] = 1; 

2. for each i € [2,n + 1] do C[i] = +oo; P[i] = NIL; 

3. for i = 1 to n do 

4. generate on-the-fly all maximal edges in FS{vi); 

5. for any (vi.vj) maximal do 

6. if C[j] > C[i] + c{v,,v,) then C[j] = C[i] + c{v,,v,y, P[j] = i; 



Fig. 1. Algorithm to compute the SSSP of the subgraph G{S). 

Theorem 1 implies that the distance between vi and Vn+i is the same in Q{S) and G{S), 
with the advantage that computing distances in Q (S) can be done faster and in reduced space, 
because of its smaller size. In fact, Lemma 2 implies that \FS{v)\ < Q{f,n) + Q{g,n), so that 

Lemma 3. Subgraph Q{S) consists ofn+1 vertices and at most n{Q{f,n) + Q{g,n)) edges. 

For Elias' codes [6], Fibonacci's codes [7], and most practical integer encoders used for 
search engines and data compressors [17,22], it is Q{f,n) = Q{g,n) = O(logn). Therefore 
[^(S)! = O(nlogn) and hence it is provably smaller than the complete graph built and used 
by the previous papers [18, 9, 8]. 

The next technical step consists of achieving time efficiency and optimality in working 
space, because we cannot construct Q{S) all at once. In the next sections we design an 
algorithm that generates G{S) on-the-fly as the computation of its SSSP goes on, and pays 
0(1) amortized time per edge and no more than 0{n) optimal space overall. In some sense, 
this algorithm is optimal for the identified sub-graph Q (S) . 

5 A bit-optimal parser 

From a high level, our solution proceeds as in Figure 1, where a variant of a classic linear-time 
algorithm for SSSP over a DAG is reported [5, Section 24.2]. In that pseudo-code entries C[i] 
and P[i] hold, respectively, the shortest path distance from vi to Vi and the predecessor of Vi 
in that shortest path. The main idea of Optimal-Parser consists of scanning the vertices of 
G{S) in topological order, and of generating on-the-fly and relaxing (Step 6) the edges outgoing 
from a vertex Vi only when Vi becomes the current vertex. The correctness of Optimal -Parser 
follows directly from Theorem 24.5 of [5] and our Theorem 1. 

The key difficulty in this process consists of how to generate on-the-fly and efficiently (in 
time and space) the maximal edges outgoing from, vertex Vi. We will refer to this problem as 
the forward-star generation problem, and use FSG for brevity. In the next section we show 
that, when a < n, FSG takes 0(1) amortized time per edge and 0{n) space in total. As a 
result (Theorem 2), Optimal-Parser requires 0{n{Q{f . n) + Q(.9, n))) time in the worst case, 
since the main loop is repeated n times and we have no more than Q{f, n) -\- Q{g, n) maximal 
edges per vertex (Lemma 2). The space used is that for the FSG-computation plus the two 
arrays C and P; hence, it will be shown to be 0{n) in total. In case of a large alphabet a > n, 
we need to add Tsort{n,cr) time because of the sorting/remapping of S"s symbols into [n]. 



Theorem 2. Given a string S[l,n] drawn from an alphabet of size a, and two integer- 
encoding functions f and g that satisfy Property 1, there exists an LZ77 -based compression 
algorithm that computes the (/, g)-optimal parsing of S in 0{n{Q{f, n)-\-Q{g, n))-\-Tsort{'n, cr)) 
time and 0{n) space in the worst case. 

5.1 On-the-fly generation of d-maximal edges 

We concentrate only on the computation of the d-maximal edges, because this is the hardest 
task. In fact, we know that the edges outgoing from Vi can be partitioned in no more than 
Q{f,n) groups according to the distance from S[i] of the copied string they represent (proof 
of Lemma 2). Let Ii,l2, - ■ ■ , lQ{f,n) be the intervals of distances such that all distances in Ik 
are encoded with the same number of bits by /. Take now the d-maximal edge {vi,Vhi,) for 
the interval Ik- We can infer that substring S[i : /ifc — 1] is the longest substring having a copy 
at distance within Ik because, by Definition 1 and Fact 2, any edge following (vijVh^) denotes 
a longer substring which must lie in a subsequent interval (by d-maximality of {vi,Vhk)), and 
thus must have longer distance from S[i]. Once d-maximal edges are known, the computation 
of the £- maximal edges is then easy because it suffices to further decompose the edges between 
successive d-maximal edges, say between {vi,Vhi,_^+i) and {vi,Vh^), according to the distinct 
values assumed by the encoding function g on the lengths in the range [hk~i, ■ ■ ■ ,hk — !]■ 
This takes 0(1) time per ^-maximal edge, because it needs some algebraic calculations and 
can then infer the corresponding copied substring as a prefix of S[i : hk — 1]. 

So, let us concentrate on the computation of d-maximal edges outgoing from vertex Vi. This 
is based on two key ideas. The first idea aims at optimizing the space usage by achieving the 
optimal 0{n) working-space bound. It consists of proceeding in Q{f, n) passes, one per interval 
Ik of possible d-costs for the edges in G{S). During the kth pass, we logically partition the 
vertices of Q{S) in blocks of |7fc| contiguous vertices, say Vi^ , Ui^+i, . . . , and compute 

all (i-maximal edges which spread from that block and have distance within Ik (thus the same 
d-cost c{Ik))- These edges are kept in memory until they are used by Optimal-Parser, and 
discarded as soon as the first vertex of the next block, i.e. Vi^+\ij^\, needs to be processed. 
The next block of vertices is then fetched and the process repeats. Actually, all passes are 
executed in parallel to guarantee that all d-maximal edges of Vi are available when processing 
it. There are n/\Ik\ distinct blocks, each vertex belongs to exactly one block at each pass, 
and all of its d-maximal edges are considered in some pass (because they have d-cost in some 
Ik)- The space is X^^K'"^ \Ik \ = 0{n) because we keep one d-maximal edge per vertex at any 
pass. 

The second key idea aims at computing the d-maximal edges for that block of contigu- 
ous vertices m.O(\Ik\) time and space. This is what we address in the rest of this paper, because 

its solution will allow us to state that the time complexity of FSG is X^feii'"^ Yll=i''^ ^(l-^fcl) = 
0{nQ{f,n)), namely 0(1) amortized time per d-maximal edge. Combining this fact with the 
above observation on the computation of the ^-maximal edges, we get Theorem 2. 

So, let us assume that the alphabet size a <n, and consider the fcth pass of FSG in which 
we assume that Ik = [l,r]. Recall that all distances in Ik can be /-encoded in the same number 
of, say, c(/fe) bits. Let B = [i,i-\-\Ik\ — 1] be the block of (indices of) vertices for which we wish 
to compute on-the-fly the d-maximal edges of cost c{Ik)- This means that the d-maximal edge 
from vertex Vh, h £ B, represents a phrase that starts at S[h] and has a copy whose starting 
position is in the window Wh = {h — r,h — l]. Thus the distance of that copy can be /-encoded 
in c(Jfe) bits, and so we will say that the edge has d-cost c{Ik)- Since this computation must 
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Fig. 2. The figure shows the interval B = with j = i + \Ik\ — l, and the window Wb and its two 
halves Wb and Wg. 



be done for all vertices in B, it is useful to consider Wb = Wi U which merges the 

first and last window and thus spans all positions that can be the (copy-) reference of any 
rf-maximal edge outgoing from B. Note that \Wb\ = 2|Jfc| (sec Figure 2). 

The following fact is crucial to efficiently compute all these d-maximal edges via proper 
indexing data structures: 

Fact 3 If there exists a d-maximal edge outgoing from Vh having d-cost c{Ik), then this edge 
can be found by determining a position s G Wh whose suffix Ss shares the maximum longest 

common prefix (shortly, Icp) with Sh- 

Proof: Among all positions s in Wh take one whose suffix Sg shares the maximum Icp with 
Sh, and let q be the length of this lop. Of course, there may exist many such positions, we 
take just one of them. Then the edge (vh, Vh+q+i) has d-cost c{Ik) and is d-maximal. In fact 
any other position s' G Wh induces the edge {vh,Vh+qi+i), where q' < q is the length of the 
lop shared between Ss' and Sh- This edge cannot be d-maximal because its d-cost is still 
c(/fe) but its length is shorter. □ 

In the rest of the paper we will call the position s of Fact 3 maximal position for vertex 
Vh- The maximal position for Vh does exist only if Vh has a d-maximal edge of cost c(/fe). 
Therefore we design an algorithm which computes the maximal positions of every vertex Vh 
in B whenever they do exist, otherwise it will assign an arbitrary position to Vh- The net 
result will be that we will be generating a supergraph of Q{S) which is still guaranteed to have 
the size stated in Lemma 2 and can be created efficiently in 0{\lk\) time and space, as we 
required above. 

Fact 3 has related the computation of maximal positions for the vertices in B to Icp- 
computations between suffixes in B and suffixes in Wb • Therefore it is natural to resort some 
indexing data structure, like the compact trie Tb, built over the suffixes of S which start in 
the range of positions B U Wb- Trie Tb takes 0{\B\ + |Wb|) = 0{\lk\) space, and that is 
within our required space bounds. We then notice that the maximal position s for a vertex 
Vh in B having d-cost c(/fc) can be computed by finding the leaf of Tb which is labeled with 
an index s G Wh and has the deepest lowest common ancestor (shortly, lea) with the leaf 
labeled h. We need to answer this query in 0(1) amortized time per vertex Vh, since we aim 
at achieving an 0(|Jfe|) time complexity over all vertices in B. However, this is tricky. In fact 
this is not the classic lea-query because we do not know s, which is actually the position 
we are searching for. Furthermore, since the leaf s is the closest one to h in Tb among the 
leaves with index in Wh, one could think to use proper predecessor/successor queries on a 
suitable dynamic set of suffixes in Wh- Unfortunately, this would take u){l) time because of 
well-known lower bounds [1]. Therefore, answering this query in constant (amortized) time 



per vertex requires to devise and deploy proper structural properties of the trie Tb and the 
problem at hand. This is what we do in our algorithm, whose underlying intuition follows. 

Let u be the lea of the leaves h and s in Tb. For simplicity, we assume that interval Wh 
strictly precedes B and that s is the unique maximal position for Vh. (our algorithm deals with 
these cases too, see the; proof of Lemma 4). We observe that h must be the smallest index 
that lies in B and labels a leaf descending from u in Tg. In fact assume, by contradiction, 
that a smaller index h' < h does exist. By definition h' G B and thus Vh would not have 
a rf-maximal edge of d-cost c{Ik) because it could copy from the closer h' a possibly longer 
phrase, instead of copying from the farther set of positions in Wh- This observation implies 
that we have to search only for one maximal position per node u of Tb, and this position 
refers to the vertex Va(^u) whose index a(u) is the smallest one that lies in B and labels a leaf 
descending from u. Computing a- values clearly takes 0{\Tb\) = 0{\lk\) time and space. 

Now we need to compute the maximal position for v^^u) , for each node u gTb- We cannot 
traverse the subtree of u searching for the maximal position of Va(u), because this would take 
quadratic time complexity. Conversely, we define and to be the first and the second 
half of Wb , respectively, and observe that any window Wh has its left extreme in W'b and its 
right extreme in W^. (See Figure 2 for an illustrative example.) Therefore the window Wa{u) 
containing the maximal position s for Va(u) overlaps both and Wq. So if s does exist, 
then s belongs to either W'b or to W'b, and leaf s descends from u. Therefore, the maximum 
(resp. minimum) among the elements in W'b (resp. W'b) that label leaves descending from u 
must belong to Wo(„) . This suggests to compute inin(u) and max(u) as the rightmost position 
in W'b and the leftmost position in W'b that label leaves descending from u, respectively. 
These values can be computed in 0(|/fc|) time by a post-order visit of Tg. 

We are now ready to compute mp[/i] as the maximal position for Vh, if it exists, or otherwise 
set mp[h] arbitrarily. We initially set all mp's entries to nil; then we visit Tb in post-order 
and perform, at each node u, the following checks whenever mp[a(u)] = nil: 

— If inin(M) e Wa(u), set mp[a(u)] — min(u). 

— If max(w) e Wa(„), set mp[o(w)] = max(u). 

At the end of the visit, if mp[a(w)] is still nil we set mp[a(7i)] = a(parent(7i)) whenever 
a{u) ^ a(parent(-«)). This last check is needed (proof below) to manage the case in which 
5[a(w)] can copy the phrase starting at its position from position a(parent(w)) and, addi- 
tionally, wc have that B overlaps Wb (which may occur depending on /). Since Tb has size 
0(|/fe|), the overall algorithm requires 0(1/^1) time and space in the worst case, as required. 
The following lemma proves its correctness: 

Lemma 4. For each position h £ B, if there exists a d-maximal edge outgoing from Vh and 
having d-cost c{Ik), then mp[h] is equal to its maximal position. 

Proof: Recall that B = [i.i + |/fc| — 1] and consider the longest path n = uiU2 . . .u^ in 
Tb that starts from the leaf ui labeled with h £ B and goes upward until the traversed 
nodes satisfy a{uj) = h, here j ~ 1, . . . ,z. By definition of o-value, we know that all leaves 
descending from and occiirring in B are labeled with an index which is larger than h. 
Clearly, a{pa.reiLt{uz)) < h (if any). There are two cases for the final value stored in mp[/i]. 

Suppose that m.p[h] G Wh- We want to prove that mp[/i] is the index of the leaf which has 
the deepest lea with h among all the other leaves in Wh- Let Ux € tt he the node in which 
the value of mp[h] is assigned (it is a{ux) = h)- Assume that there exists at least another 
index in Wh whose leaf has a deeper lea with leaf h- This lea must lie on mi . . . ttx-i, say 



ui. Since is an interval having its left extreme in and its right extreme in W^, the 
value max(u/) or min(u/) must lie in Wh and thus the algorithm has set mp[h] to one of these 
positions, because of the post-order visit of Tg. Therefore mp[/i] must be the index of the leaf 
having the deepest lea with h, and thus by Fact 3 is its maximal position (if any). 

Now suppose that mp[/i] ^ Wh and, thus, it cannot be a maximal position for Vh- We 
have to prove that it does not exist a d-maximal edge outgoing from the vertex Vh with 
cost c(/fc). Let Ss be the suffix in Wh having the maximum Icp with Sh, and let I be the 
Icp-lcngth. Values miniui) and max(uj) do not belong to Wh, for any node Ui G tt (with 
a{ui) = h), otherwise mp[/i] would have been assigned with an index in Wh (contradicting 
the hypothesis). The value of inp[/i] remains nil up to node u^- This implies that no suffix 
descending from Uz starts in Wh and, in particular, Ss does not descend from Uz- Therefore, 
the lea between leaves h and s is a node in the path from parent(u2) to root, and the 
lcp(5'a(parent(M,)), 5';i) > lcp(5's, Sh) = 1- Siucc a{-pa.rent{uz)) < a(uz) and belongs to B, it is 
nearer to h than any other position in Wh, and shares a longer prefix with Sh- So we found 
longer edge from Vh with smaller d-cost. This implies that Vh has no d-maximal edge of cost 
c(Jfe) in g{S). □ 

We are left with the problem of building 7b in 0(|/fe|) time and space, thus a time 
complexity which is independent of the length of the indexed suffixes and the alphabet size. 
In Appendix A we show how to achieve this result by deploying the fact that the above 
algorithm does not make any assumption on the ordering of Tb, because it just computes 
(sort of) lea-queries on its structure. This is the last step to prove Theorem 2. 



6 Conclusions 

Our parsing scheme can be extended to variants of LZ77 which deploy parsers that refer to 
a bounded compression-window (the typical scenario of gzip and its derivatives [17]). In this 
case, LZ77 selects the next phrase by looking only at the most recent w input symbols. Since 
w is usually a constant chosen as a power of 2 of few Kbs [17], the running time of our 
algorithm becomes 0{n Q{g,n)), since Q{f, w) is a constant. We notice that the remaining 
term could further be refined by considering the length £ of the longest repeated substring in 
S, and state the time complexity as 0{n Q{g,£)). If 5* is generated by an ergodic source [20] 
and g is taken to be the classic Elias' code, then Q{g, £) = 0(log log n) so that the complexity 
of our algorithm results 0(n log log n) time and 0{n) space for this class of strings. 

We finally notice that, although we have mainly dealt with the LZ77-dictionary, the tech- 
niques presented in this paper could be extended to design efficient bit-optimal compressors 
for other on-line dictionary construction schemes, like LZ78. Intuitively, we can show that 
Theorem 1 still holds for any suffix- or prefix- complete dictionary under the hypothesis that 
the codewords assigned to each suffix or prefix of a dictionary phrase w are shorter than the 
codewords assigned to w itself. In this case the notion of edge maximality (Definition 1) can 
be generalized by calhng an edge {vi,Vj) maximal iff all longer edges, say (wj, vy) with j < j', 
have not larger cost, namely Wij/ < Wi,j. In this case, we can provide an efficient bit-optimal 
parser for the LZ78-dictionary (details in the full paper). 

The main open question is to extend our results to statistical encoding functions like 
Huffman or Arithmetic coders applied on the integral range [n] [22]. They do not necessarily 
satisfy Property 1 because it might be the case that |/(a;)| > \f{y)\, whenever the integer y 
occurs more frequently than the integer x in the parsing of S. We argue that it is not trivial to 



design a bit-optimal compressor for these encoding-functions because their codeword lengths 
change as it changes the set of distances and lengths used in the parsing process. 

Practically, we would like to implement our optimal parser motivated by the encouraging 
experimental results of [8], which have improved the standard LZ77 by a heuristic that tries 
to optimize the encoding cost of just phrases' lengths. 
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APPENDIX A. Building Tb optimally 



We are left with the problem of building Tb in 0(|/fc |) time and space, thus a time complexity 
which is independent of the length of the indexed suffixes and the alphabet size. We show 
how to achieve this result by deploying the crucial fact that the algorithm of Section 5.1 to 
compute the d-maximal edges docs not make any assumption on the ordering of edges in Tb, 
because it just computes (sort of) lea-queries on its structure. This is the last step needed 
to complete the proof of Theorem 2, and we give its algorithmic details here. 

At preprocessing time we build the suffix array of the whole string 5* and a data structure 
that answers constant-time Icp-queries between pair of suffixes (sec e.g. [15]). These data 
structures can be built in 0{n) time and space, when a = 0{n). For larger alphabets, we 
need to add Tsort{n,a) time, which takes into account the cost of sorting the symbols of S 
and re-mapping them to [n] (see Theorem 2). 

Let us first assume that B and Wb are contiguous and form the range + 3|Jfe| — 1]. 
If we had the sorted sequence of suffixes starting in S[i,i + 'i\Ik\ — 1], we could easily build 
Tb in 0{\lk\) time and space by deploying the above Icp-data structure. Unfortunately, it is 
unclear how to obtain from the suffix array of the whole S, the sorted sub-sequence of suffixes 
starting in the range -I- 3|/fe| — 1] by taking 0{\B\ -t- |>Vb|) = 0(|/;j|) time (notice that 
these suffixes have length 0{n — i)). We cannot perform a sequence of predecessor /successor 
queries because they would take li;(1) time each [1]. Conversely, we resort the key observation 
above that Tb does not need to be ordered, and thus devise a solution which builds an 
unordered Tb in 0{\lk\) time and space, passing through the construction of the suffix array 
of a transformed string. The transformation is simple. We first map the distinct symbols of 
S[i,i 4- 3|/fc| — 1] to the first 0(|//c|) integers. This mapping does not need to reflect their 
lexicographic order, and thus can be computed in 0{\lk\) time by a simple scan of those 
symbols and the use of a table M of size a < n. Then, we define S'^ as the string S which 
has been transformed by re-mapping some of the symbols according to table M (namely, 
those occurring in S[i,i + 3\Ik\ — 1]). We can prove that 

Lemma 5. Let Si,. . . , Sj be a contiguous sequence of suffixes in S. The remapped suffixes 
S-^ . . . Sj^ can be lexicographically sorted in 0{j — i + 1) time. 

Proof: Consider the string of pairs w = (^^[i], 6^) . . . {S'^[j],bj)$, where bh is 1 if > 

Sj^i, —1 if S'^]^ < Sj^i, or if /i = j. The ordering of the pairs is defined component-wise, 
and we assume that $ is a special "pair" larger than any other pair in w. For any pair of 
indices p,q G [1 ... j — i], it is Sp^^ > S'^^.^ iff Wp > Wq. In fact, suppose that Wp > Wq and set 
r = lcp(u!p, Wq). We have that w[p + r]^ {S^[p + i + r], > {S^''[q + i + r], 6,+^+^) = 

w[q + i + r]. Hence 5^^+^ > 5^^+^, by definition of the Vs. Therefore > 5,^,, since 
their first r symbols are equal. This implies that sorting S^^ , . . . , Sf' reduces to computing 
the suffix array of w, and this takes 0{\w\) time given that the alphabet size is 0(|w|) [15]. 
Clearly, w can be constructed in that time bound because comparing with S'j^j takes 
0(1) time via an Icp-query on S (using the proper data structure above) and a check at their 
first mismatch. □ 

Lemma 5 allows us to generate the compact trie of 5*^, . . . , Sf^^^^^_-y, which is equal to 
the (unordered) compacted trie of Si,. . . , S'i_|_3|/^|_i after replacing every ID assigned by M 
with its original symbol in S. We finally notice that if B and Wb are not contiguous (as 
instead we assumed above), we can use a similar strategy to sort separately the suffixes in 



B and the suffixes in Wb, and then merge these two sequences together by deploying the 
Icp-data structure mentioned at the beginning of this section. 



